Computer Methods and Programs in Biomedicine
○ Elsevier BV
Preprints posted in the last 7 days, ranked by how well they match Computer Methods and Programs in Biomedicine's content profile, based on 27 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.
Warnecke, J. M.; Baumgärtel, D.; Bollmann, J.; Deserno, T. M.
Show abstract
Background Continuous health monitoring enables early detection of diseases and improves therapeutic outcomes. Non-intrusive biosignal sensors, such as capacitive ECG (cECG), offer a practical solution for daily monitoring in private environments, such as smart homes and vehicles. However, artifacts reduce signal quality and compromise reliability. Methods Following a registered report protocol (Warnecke JM et al. Plos One. 2021; 16(7):e0254780), we record data of 44 subjects and develop an artifact index for cECG. We use three signal quality indices (SQIs): the correlation of QRS complexes (corSQI), the R-peak detection consistency (bSQI) and the absolute amplitude ratio (aSQI). Our index classifies overlapping 10s segments with a step-width of 2s into clean or artifact segments. We label a 2s interval as artifacts if all five overlapping segments indicate artifacts. We record cECGs using an armchair with integrated electrodes in a single-arm study involving 44 subjects performing two activities -- reading and watching television (TV); for 11 minutes each. We record a time-synchronized reference ECG with skin electrodes on the chest. To evaluate the artifact index, we compare it with manually generated ground truth. Moreover, we evaluate the clothing materials cotton, linen, jeans, and polyester in 5 subjects. Results Watching TV results in longer, continuously clean signal durations than reading. On average, 88.3% of the signal has a minimum continuous clean duration of 10s, versus 79.8% during reading. All clothing configurations achieve a clean signal duration exceeding 10s. Among the SQI metrics, bSQI performs best, achieving an accuracy of 90.7% and an F1 score of 79.9%. Combining the three SQI metrics in a voting approach improves accuracy to 92.0% and F1 score to 82.1%. Discussion Our artifact index automatically distinguishes clean from artifact cECG segments, promoting health monitoring in unsupervised real-world settings, earlier disease detection, and preventive health management. A limitation is the investigation of only two scenarios (reading and watching TV).
Chen, M.; Li, X.; Yang, K.; Taramasso, M.
Show abstract
**Abstract** **Background:** Transcatheter edge-to-edge repair (TEER) is an established treatment for mitral regurgitation but remains highly dependent on operator experience and complex transesophageal echocardiography (TEE)-guided intraprocedural imaging. Artificial intelligence (AI)-based semantic segmentation may improve procedural reproducibility and intraprocedural guidance; however, no TEER-specific segmentation framework has been reported. **Objectives:** To develop and evaluate AutoClip, a clinician-driven AI-guided TEE semantic segmentation model designed for simultaneous delineation of mitral valve anatomy and in-vivo TEER device components. **Methods:** A retrospective proof-of-concept study was conducted using 987 intraprocedural TEE frames derived from 10 video clips in 3 patients undergoing MitraClip G4 implantation. Seven semantic labels, including mitral leaflets and device components, were manually annotated using ITK-SNAP. Following standardized preprocessing and region-of-interest extraction, an Attention U-Net architecture was trained frame-wise on bicommissural and corresponding X-plane TEE views. Model performance was assessed using mean intersection-over-union (IoU) and Dice coefficient on an independent test set. **Results:** The Attention U-Net demonstrated improved sensitivity to small device structures compared with conventional U-Net architectures. Preliminary training performance achieved a mean IoU of approximately 0.93, while independent test performance reached a mean IoU of 0.46 across foreground classes. Qualitative assessment demonstrated feasible simultaneous segmentation of mitral leaflets, clip arms, grippers, and delivery shaft during TEER procedures. **Conclusions:** AutoClip represents a proof-of-concept TEER-specific TEE semantic segmentation framework initiated through a clinician-oriented workflow without formal computer science expertise. Although preliminary accuracy remains modest due to limited sample size, this study establishes a reproducible pathway for future AI-assisted intraprocedural guidance systems and larger multicenter development efforts in structural heart interventions.
Rajeev, M.; Narayan, A.
Show abstract
Background: Unstructured data represent about 80% of total electronic health records (EHR) data. Structuring this free text is essential for advancing clinical research, including cohort selection for trials, retrospective studies, and the development of disease registries. While manual chart review (MCR) remains the gold standard for extracting this clinical data, the process is inherently slow, resource-intensive, and susceptible to errors from human fatigue. We evaluated the extraction accuracy, safety, and efficiency of the HeLIX (Hepatology Logic-Integrated Extraction) framework, a Large Language Model (LLM) protocol using Google Gemini 3 Pro, compared to a gold-standard Manual Chart Review (MCR). Methods: A prospective validation study was conducted using 50 high-complexity, simulated hepatology discharge summaries designed to replicate the real-world heterogeneity of EHRs. The HeLIX framework employed a Zero-Shot, Structured Chain-of-Thought (CoT) prompting strategy enforced by a three-layer architecture: Clinical Reasoning Trace, Schema Enforcement, and Evidence Verification. The model extracted 45 distinct clinical variables. Performance was benchmarked against a consensus MCR. Results: Across 2,250 evaluated data points, the model achieved an overall Extraction Accuracy of 99.24% (95% CI: 98.8%-99.5%), with perfect concordance in 35/45 (77.8%) variables. For binary diagnostic variables, the model demonstrated an overall F1-score of 0.98, Recall of 0.99 and substantial inter-rater reliability (Cohens {kappa} = 0.97). Hallucinations were exceptionally rare (2/2250; 0.08%). Critical errors affecting clinical management occurred in only 2 instances (<0.1% of total data), both involving etiological misattribution in complex multifactorial diagnoses. The AI workflow was 13.4-fold faster and 95.1% more cost-effective than manual extraction. Conclusion: The HeLIX framework demonstrates physician-level accuracy and reliability in extracting complex hepatology data. It offers a scalable, efficient, and economical alternative to manual chart review. Such frameworks could accelerate clinical research, enabling healthcare systems globally to build comprehensive patient registries for a fraction of the traditional cost.
Landry, T. C.; Kim, Y.
Show abstract
Background. Capillary refill time, an examiner-dependent bedside test of distal microvascular perfusion, has become a resuscitation target in septic shock,1,2,3,4 motivating a continuous surrogate computed from the photoplethysmogram (PPG, the optical waveform the pulse oximeter on every ICU patient already records).5,6,7,8 Objective. We attempted three PPG-derived candidate measures on the MIMIC-IV Waveform Database (MIMIC-IV-WDB v0.1.0) and asked, by inspecting randomly drawn examples, whether each captured its intended physiology before any downstream modeling. Methods. MIMIC-IV-WDB v0.1.09 was linked to MIMIC-IV.10 The signals were a cuff-anchored perfusion-index recovery (reactive hyperemia when the cuff shares an arm with the probe), a slow Mayer-wave-band power ratio of the perfusion index (sympathetic vasomotor tone), and a per-beat diastolic exponential decay time constant (a refill-like recovery time). For each signal we drew 10 random examples at a fixed seed and checked them against a checklist fixed in advance. Each was read by the author and, separately, by MedGemma 1.5, a multimodal medical language model run locally. A synthetic test with a known time constant checked the third signal. Results. The cuff-anchored signal showed the expected occlusion-reperfusion shape on 268 of 6,236 evaluable cuff cycles (4.30%) in 15 of 19 patients, consistent with opposite-limb placement of the probe and cuff. The slow-band ratio returned a stable cohort value, but a clear, stationary peak appeared in only4 of 10 random windows. The per-beat fit met its goodness-of-fit threshold in 10 of 10 beats, yet a cardiac-frequency heuristic flagged a possible fit on the heart-rate oscillation in 7 of 10, and in 5 of 17 patients the time constant lay where an exponential is indistinguishable from a straight line. A 0.5Hz high-pass pre-filter implanted its own approximately 318 ms time constant regardless of truth. The language model tracked the human on clear positives but reported the pattern present on every call it returned, never absent. Conclusions. Two of the three candidate signals did not reflect their intended physiology in most examples, and the third was constrained by sensor placement. Inspecting a few random raw inputs against a checklist written in advance is an inexpensive upstream check before downstream inference on PPG-derived microvascular signals.
Mayar, S.; Henriksen, M.; Christensen, R.; Hansen, P.; Bliddal, H.; Nybing, J. U.; Nielsen, C. T.; Gudbergsen, H.; Boesen, M. P.; Brejnbol, M. W.
Show abstract
Background and rationale: Knee osteoarthritis (KOA) is a leading cause of lower limb disability worldwide, characterized by functional limitations, stiffness and pain. The incidence of KOA is especially tied to age and obesity. It is a disabling disease that often makes patients less physically active, thus increasing the risk of other diseases and mortality1. The clinical diagnosis of KOA is based on the symptoms and functional limitations of the joint. The diagnosis is usually supported with a radiograph (X-ray) of the weight-bearing knee. Radiographic features, such as Kellgren-Lawrence grade, are used as eligibility criteria for clinical studies while other features, such as joint space width (JSW), are used as endpoints for structural KOA progression2,3. While the use of these radiographic features is standard in academia, the use of JSW as a structural biomarker has received criticism. Critics point out that JSW is an indirect and projection dependent measure of cartilage deterioration which is sensitive to technical factors such as the angulation of the X-ray beam and the positioning of the knee. Small differences in these factors can alter the measured joint space and may not reflect true disease progression4,5. Despite limitations, minimum joint space width (mJSW) remains as one of the most widely used structural biomarkers in KOA trials and is currently one of the only structural imaging accepted in regulatory guidance as evidence of disease modification in OA drug development3. For JSW to be reliable and consistent in determining the advancement of KOA, the use of fixed-flexion devices is crucial to reduce the risk of unwanted narrowing or widening of the radiographic joint space width6,7. The LOSEIT trial, which the present study is based on, acknowledges the angulation problem and uses a standard clinical fixed-flexion device in weight-bearing PA views to get reliable JSW results8. Historically, a radiologist would draw on and grade radiographs of the knee-joint to extract the features. However, manual reading and annotation is time consuming with notable interobserver variance9. With increasing computational power and the use of deep neural networks, off-the-shelf artificial intelligence (AI) tools have become available for automatic extraction of radiograph features. Automation would free up time from radiologists and provide more consistent measurements due to the reproducible nature of the models10. These tools have received regulatory approval for commercial use, however, regulatory approval does not guarantee uniform or bias free performance when used on real-world data11. Furthermore, in a large multi-hospital chest X-ray study, Zech et al., showed that convolutional neural networks achieved worse results on data from other hospitals than on the original hospitals in which it was tested12. This highlights the risk of overestimating the accuracy of AI tools when only internally validated. It is therefore apparent that external validation is required when testing these AI models. Objectives: The aim of this analysis is to evaluate the agreement of a commercially available AI tool for measuring JSW with the best practice radiologist annotation in the tibiofemoral joint of the knee in radiographs stabilized with a fixed-flexion device and acquired as part of a clinical trial. Methods: This study is a secondary analysis of the data from the LOSEIT trial, a randomized, double-blind, placebo-controlled, single-center trial, where patients were randomized to either liraglutide or identically appearing placebo after an initial weight-loss period to investigate the effects on KOA. Radiographs of the tibiofemoral joint were acquired at enrollment (week -8) and at end-of-trial (week 52) for a total acquisition-to-acquisition time of 60 weeks13. The primary analysis will assess agreement between AI-derived and reference-derived change in JSW from enrolment to follow-up. Change will be calculated as follow-up minus enrolment separately for the AI tool and the reference measurement. The main measure of interest will be the change in medial minimal JSW (mmJSW), with change in lateral minimal JSW (lmJSW), medial fixed JSW (mfJSW) and lateral fixed JSW (lfJSW) as secondary measures. This study will follow an equivalence framework using the two one-sided tests (TOST) approach with a Bland-Altman analysis as the main outcome. The equivalence margin will be set at {delta} = 0.5 mm. Agreement consistent with equivalence will be considered established if the upper limit of the 95% confidence interval (95% CI) for the upper limit of agreement (LoA) and the lower limit of the 95% CI for the lower LoA are within the established margins. The reference JSW will be the average measurement of two independent resident radiologists. If there is a mismatch in the measurements of more than 0.40 mm between the two radiologists, the radiologists will re-annotate the case independently. If the difference remains greater than 0.40 mm, a musculoskeletal radiology consultant will review the radiograph and establish the reference JSW. The index test will be the measurements output by the AI tool. Populations: Patients aged 18 to 74 with symptomatic knee osteoarthritis, radiographically confirmed KL grade 1-3, with a BMI [≥]27, motivated for weight loss and in accordance with the LOSEIT trial inclusion criteria Further statistical details Sample size: Not applicable as this is a secondary analysis. Framework: This is an agreement study assessing the equivalence of a commercially available AI tool for radiographic evaluation of knee osteoarthritis with best practice radiologist measurements. Confidence intervals and P values: All 95% confidence intervals and P-values will be two-sided. Statistical software: SAS Studio and/or R version 4.2.2 (or newer).
Uskova, N. G.; Gombolevskiy, V. A.; Chernina, V. Y.; Burenchev, D. V.; Akhaladze, D. G.; Panina, E. V.; Karachunskiy, A. I.; Tereschenko, G. V.; Goncharov, M. Y.; Soboleva, E. A.; Konopleva, E. I.; Bydanov, O. I.; Plekhov, S. Y.; Grachev, N. S.
Show abstract
Background. Lung metastases in osteosarcoma (OS) are the main cause of the death. The accuracy of the diagnosis of nodules by computed tomography (CT) of the lungs is critically important for determining the disseminated stage of the disease and planning surgical treatment. The use of artificial intelligence (AI) in the search for lung nodules increases the accuracy of diagnosis and reduces the chance of missing metastases. Objective: to evaluate the accuracy of lung nodules diagnosis in adolescents with OS using AI. Methods. A retrospective assessment of CT scans of adolescents with OS was performed. A pathological nodule with an average size of [≥]4 mm was considered a target finding. The diagnostic accuracy of an AI algorithm previously trained on an adult dataset was evaluated, and the number of false positives (FP) and false negatives (FN) was determined. Sensitivity, specificity, accuracy, area under the ROC curve (AUC), positive predictive value, negative predictive value, and F1-measure were calculated. Based on the obtained results, the effectiveness of the algorithm was assessed. Results. 248 CT scans of adolescents with OS were evaluated. The following results were obtained: in 5 cases, the AI algorithm showed a FP result (2.02%), in 34 cases, it showed a FN result (13.71%), and in 209 cases, a correct result (both true positive and true negative) (84.27%). The diagnostic accuracy of the algorithm was 0.843 (95% CI 0.794-0.887). The application of the AI algorithm in the practice of an X-ray doctor in a specific clinical task would allow to increase the sensitivity from 0.805 to 0.891, while ensuring an absolute decrease in the number of FN results by 8.59% and a relative decrease by 44%. Conclusion. The obtained results confirm the practical value of the application of the AI algorithm and justify the implementation of AI-assisted systems in the diagnostic protocols for lung metastases in adolescents with OS.
Leonhardt, R.; Lindemann, U.; Schneider, M.; Rapp, K.; Klenk, J.
Show abstract
Background: Wheeled walkers can improve safety during walking, but improper use may increase fall risk among frail older adults. No suitable tool exists to assess safe indoor wheeled walker use in this population. This study aimed to develop and validate a video-based expert assessment tool. Methods: Based on the literature and expert consensus, seven problematic indoor situations were identified, and an assessment tool with five safety criteria per situation was developed (maximum score = 35). Fifty participants (mean age 83.9 years, 64% women) from a geriatric rehabilitation clinic and a nursing home were video-recorded while using a rollator. Expert ratings were compared with nursing staff ratings, self-ratings, and the Timed Up and Go test to evaluate validity. Intra- and inter-rater reliability were determined from independent ratings by two physiotherapists and a repeated expert rating after seven days. Sensitivity to change was assessed after two weeks of rehabilitation, and feasibility by the time required for assessment. Results: The expert score of rater 1 at baseline was 28.5 points, and assessment required a mean of 17.5 minutes. Intra-rater reliability was excellent (ICC = 0.98) and inter-rater reliability was good (ICC = 0.80). Validity analyses showed the strongest association with nursing staff assessments (r = 0.74) and a moderate association with the Timed Up and Go test (r = -0.45). After two weeks, patients improved by an average of 2.38 points (8.4% of baseline score). Conclusions: The new instrument demonstrated high reliability, acceptable validity, sensitivity to change, and good feasibility for assessing safe wheeled walker use in frail older adults. Trial registration number and date of registration: DRKS00038358, 07/11/2025
Kadivar, M.; Alyamani, M.; Mori, M.; Kadivar, M.; Jonsson, J.; Hertervig, E.; Grip, O.; Svensson, L.; Erjefalt, J. S.; Marsal, J.
Show abstract
Background: Histological examination of mucosal tissue in inflammatory bowel diseases (IBD) is a sensitive tool to measure disease activity, and histological remission is emerging as a potentially important treatment target. There are several existing histopathological indices, but they often encompass caveats such as not primarily having been designed to measure the degree of inflammation, encompassing subjective components with poor intra- and interindividual reproducibility, and requiring expert pathologists who are scarce, thus resulting in extended response times. Aim: To construct a new computerized, automated index to objectively measure histological disease activity in the ileal and colonic mucosa, applicable to both Crohn's disease (CD) and ulcerative colitis (UC). Materials and methods: Ileocolonic biopsies were collected from control subjects and patients with CD or UC. A group of CD patients was sampled before and after 12 weeks of anti-TNF therapy. Another group of CD and UC patients functioned as a small validation cohort. Epithelial cells, neutrophils, macrophages, and T cells were immunohistochemically stained, followed by digitalization of the color signal and computerized delineation of the epithelial and lamina propria compartments. The various immune cell types within the epithelium and the lamina propria, respectively, were enumerated, and the numbers were compared between control subjects and patients with CD or UC. Results: The numbers of neutrophils and macrophages in the epithelium, and neutrophils in the lamina propria, showed the highest sensitivity and specificity for distinguishing control-subject tissues from CD and UC tissues. These three parameters were thus chosen to construct a new index, named QiC3 1.0, that could separate tissues from control subjects and patients with CD or UC with high precision. It performed equally well in a small validation cohort of patients. The QiC3 index correlated well with previously described histopathological indices, fecal calprotectin, and endoscopic scores in UC, but showed worse correlation with endoscopic scores in CD and symptomatic scores. When applying the new index to tissues from CD patients before and after therapy, it showed good responsiveness, demonstrating a distinct amelioration in the microscopic inflammatory status that corresponded well to improvements in histopathological scores. Conclusion: We describe a new quantitative, computerized, automated, non-subjective, and response-sensitive immunohistological index (QiC3) for measuring disease activity in ileal and colonic mucosal biopsies, suitable for both CD and UC.
Wang, E.; Grenier, K.; Savadjiev, P.; Poenaru, D. D.
Show abstract
Background. Definitive diagnosis of Hirschsprung's disease (HD) requires pathological identification of enteric ganglion cells. This process is time-consuming and subject to inter-observer variability. Artificial intelligence (AI) tools have the potential to standardize and accelerate this workflow, but no study has determined which AI approach best serves intraoperative HD pathology diagnostics. Method. This study compared the U-Net and You Only Look Once version 26 (YOLO26) frameworks for ganglion cell detection using a single-centre retrospective dataset of 54 whole-slide images (WSIs) from rectal biopsies. WSIs were tiled into 397,731 image patches (128x128 pixels), further partitioned into training (70%), validation (15%), and testing (15%) sets. Models were evaluated on tile- and patient-level diagnostic metrics and processing latency. Results. The U-Net achieved a tile-level sensitivity of 82.9%, showing no statistically significant difference compared to YOLO26 (79.1%; p = 0.097). However, YOLO26 demonstrated a statistically significant advantage in tile-level specificity (96.1% vs. 93.9%; p < 0.001) and reduced mean inference latency (7.64 ms vs. 11.57 ms/tile). At the patient level, both models achieved 100% diagnostic sensitivity. Despite low patient-level specificity (0.0% U-Net; 11.8% YOLO26), the tissue-level diagnostic burden of false positives was 6.00% for U-Net and 3.50% for YOLO26. Conclusion. The U-Net is preferred when nominal gains in sensitivity are prioritized, while the YOLO26 is an alternative that optimizes efficiency and false positive suppression. Both models serve as robust screening filters to augment the pathologist's workflow and should be selected based on workflow requirements. Prospective validation on larger, multi-centre datasets is required before clinical implementation.
Seidel, A.; Steiger, E.; Schuster, J.; Kroll, L. E.
Show abstract
Background: Digital decision-support tools such as triage systems and symptom checkers support millions of health-related decisions each year. Their quality and safety are commonly evaluated using textual patient cases, known as case vignettes. However, existing vignette sets written by medical experts cover only a limited spectrum of real-world patient presentations and lack population weights, which would allow extrapolating evaluation results to the underlying patient population. Objective: This study aims to develop a data-driven framework for automatically generating a human-manageable set of case vignettes from nationwide triage data that captures broad presentation diversity and links each vignette to a quantitative weight reflecting the number of underlying patient assessments. Methods: From 3.2 million triage assessments conducted over one year using structured triage software in the German medical on-call service (telephone triage and online self-triage) and at the joint contact points of the outpatient emergency care service and hospital emergency departments, we randomly sampled 50,000 cases. Triage questionnaires were converted into semantic embeddings using a German Sentence Transformer Model and grouped by agglomerative clustering. For clusters containing sufficient assessments, we generated one representative assessment using a two-phase simulated-annealing optimization. The optimization minimized the distance to the cluster centroid while maximizing the number of answered triage questions, aiming for high representativeness and information content. Each representative assessment was assigned the size of its source cluster as its sample-based weight. A similarity-based sensitivity analysis was performed to examine whether these weights were preserved in the full 1-year population. Finally, the question-answer pairs of the representative assessments were converted into structured textual case vignettes using controlled prompting of a large language model. Results: The cluster analysis yielded 514 included clusters covering 96.8% of the sampled 50,000 assessments. The generated representatives showed strong agreement with the majority treatment-urgency recommendation of their source cluster (Spearman's {rho}=0.78, p<0.001) and contained on average 4.3 more answered triage questions than the original assessments within their clusters. When weighted by cluster size, the representatives approximated the sample distributions of treatment urgency, demographics, and symptoms, although some systematic deviations remained, most notably an overrepresentation of female cases (+13.5%), patients aged 14-49 years (+8.0%), and the urgency category "As soon as possible" (+6.6%). Of 121 recorded symptoms, 101 (83.5%) were covered by the representatives; the rest each occurred in <0.5% of the sample. In a sensitivity analysis, cluster-based vignette weights were strongly correlated with similarity-based population weights (Spearman's {rho}=0.77, p<0.001), and 90.1% of assessments in the full 1-year population were matched to at least one vignette. Conclusions: We present a data-driven framework for deriving a manageable set of population-weighted case vignettes from nationwide triage data. The resulting vignettes captured broad presentation diversity, approximated key sample characteristics, and provided an explicit quantitative link to the number of underlying patient assessments. After medical expert review and refinement, the vignettes may support more population-aware evaluation and quality assurance of digital decision-support tools.
Jean, A.; Merceron, A.; Le Saux, A.; Mercier, E.; Benillouche, P.
Show abstract
This study aims to assess women's perceptions of artificial intelligence (AI) used in breast cancer screening in France by examining their knowledge of AI and the barriers to their participation in organized screening. The results of a survey conducted in June 2025 among a national sample of 2000 women (aged 40-75) reveal limited participation and persistent concerns among women. Nevertheless, despite a low awareness of specific AI applications, a large majority of the women surveyed are very favorable to the use of AI in breast cancer diagnosis, even considering it a lever to increase screening participation.
Tharzeen, A.; Vafaei Sadr, A.; Radfar, N.; Hwang, W.; Abedi, V.; Zand, R.
Show abstract
Background: Machine learning models for stroke mortality prediction typically treat each time horizon independently and use flat tabular features that ignore the relational structure of electronic health records (EHRs). In this pilot study, we leveraged graph-based machine learning models to predict post stroke all-cause-mortality across three different time horizons. Methods: We developed Stroke Temporal Heterogeneous Graph (StrokeTHG), a heterogeneous graph neural network model for simultaneous multi-horizon stroke mortality prediction (30-day, 90-day, 1-year) using EHR data from Penn State Health System. The model encodes various relations among EHR entities (e.g., patient, diagnosis, comorbidity) and temporal encoding of admission time to better predict stroke mortality. We compared our proposed approach against various baseline methods, including Logistic Regression, Random Forest, and XGBoost. We also performed ablation and subgroup analyses, evaluated the quality of learned graph embeddings, and assessed the importance of different edge types in the graph. Results: We included 4,144 stroke patients (mean age 69.2 years; 54.3% men), of whom 3,332 (80.4%) survived their stroke after one year. 30-day, 90-day, and 1-year mortality rates were 9.7%, 13.7%, and 19.6%, respectively. Our proposed approach, StrokeTHG, achieved AUROC of 0.872, 0.878, and 0.837 across horizons, outperforming all tabular baselines. At [≥] , 75% specificity, the model identified 5-10 percentage points more mortality cases than the best baseline at each horizon. Subgroup analysis demonstrated consistent performance across sex subgroups and the largest discriminative gains in the Age 65-80 stratum. Edge-type ablation identified phenotype-patient and admission-patient edges in the constructed EHR graph as the most influential relational edges for mortality prediction. StrokeTHG embeddings outperformed all graph and matrix factorization baselines under an identical downstream classifier, confirming that performance gains stem from representation quality rather than classifier capacity. Conclusions: StrokeTHG demonstrates that heterogeneous graph representations of EHR data provide a consistent improvement over flat tabular models for multi-horizon stroke mortality prediction, with particular advantage at clinically actionable sensitivity thresholds and novel multi-horizon monotonic prediction capability. This methodological framework may be adaptable to other EHR-based clinical research studies seeking to leverage heterogeneous relational structures for predictive modeling.
Molla, A. R.; Maity, A.; Saha, S.; Bhattacharya, R.; Chakraborty, A.; Biswas, S.; Nath, S.
Show abstract
Skin cancer requires early detection for improved survival rates. Most existing methods rely on deep learning based image classification, which is affected by visual similarity among lesions. Fewer studies use Gene Expression (GE) analysis, which captures molecular characteristics but lacks structural and visual details. To overcome limitations of individual modalities, this paper proposes a multimodal framework integrating dermoscopic images and GE profiles for skin cancer classification. EfficientNet and logistic regression are used for image based analysis and genomic skin lesion profiling, respectively, followed by fuzzy rule based decision systems to reduce uncertainty within individual modalities. Finally, fuzzy fusion combines predictions from both modalities using uncertainty based weighting of classifier outputs. The experimental findings show that both the image based and GE based classification models individually achieved accuracies of nearly 92%. However, the integration of prediction results through the proposed fuzzy fusion strategy further enhanced the classification performance, achieving an overall accuracy of 94.25%. The results obtained outperform contemporary methods, highlighting the effectiveness of combining complementary multimodal information compared with single modality approaches.
Bunuel-Muriscot, A.; Gonzalez-Crespo, I.; Otero-Casal, P.; Gomez-Caamano, A.; Pardo-Montero, J.
Show abstract
The purpose of this work is to analyze the 2-year overall survival (OS2y) of limited-stage small cell lung cancer (LS-SCLC) treated with chemoradiotherapy (CRT), aiming at characterizing the response of LS-SCLC, and in particular the /{beta} value and proliferation parameters. Through a systematic analysis of the literature, we collated a dataset containing 57 entries (3363 patients) of response of LS-SCLC treated with CRT. Radiotherapy schedules ranged from hyper- to hypofractionation. Four radiobiological models to describe the OS2y were investigated, with progressive levels of complexity including the effect of radiotherapy, chemotherapy, treatment year and toxicity. The Akaike Information Criterion (AIC) was used to compare models, and the profile likelihood methodology to compute confidence intervals. Model 4, which includes the effect of radiotherapy, chemotherapy, treatment year and dose-dependent toxicity, provided the best fits of the experimental data (lowest AIC value). While being the best model, model 4 still fails to provide a good prediction of the OS2y, in particular failing to predict the survival of the schedules achieving the lower/higher survivals. The radiobiological analysis of the dose-response of LS-SCLC to CRT does not allow to narrowly constrain the value of response parameters. We attribute this limitation to the large heterogeneity of this disease. Nonetheless, our analysis shows a large /{beta} value (>9 Gy, 95% CI), which implies a low fractionation effect in the radiotherapy of LS-SCLC. and an accelerated proliferation of tumor cells, {lambda}' > 1.6 Gy/day (95% CI), after a kick-off time of ~4-5 weeks, which supports the use of accelerated protocols to avoid the effect of tumor proliferation on the clinical outcome.
Tahir, W.; Shamshoian, J.; Tauber, J.; Clinton, L. K.; Griffin, M.; Shah, C.; Singh, G.; Fahy, D.; Sucipto, K.; Brosnan-Cashman, J.; Altepeter, T. A.; Bhattacharya, S.; Crandall, W.; Duan, C.; Gale, J. D.; Gupta, V.; Haarmann, H.; Harpaz, N.; Hooper, A. T.; Horowitz, J.; Hurtado-Lorenzo, A.; Hussaini, B. E.; Jairath, V.; Jones, A.; Kostiuk, B.; Kukreja, A.; Laroux, F. S.; Lissoos, T.; McBride, R. B.; Najdawi, F.; Nayyar, A.; Osterman, M. T.; Panchal, P.; Ruane, D.; Travis, S.; Visvanathan, S.; Wilson, L.; Jayson, C.
Show abstract
In clinical trials for ulcerative colitis (UC), pathologists assess disease severity through standardized histological indices, including the Geboes Score, Robarts Histopathology Index (RHI), and Nancy Histologic Index (NHI). Despite strong associations with clinical outcomes, histologic scoring suffers from inter- and intra-reader variability, and consensus criteria for histologic remission remain uncertain. Through a consortium approach, we developed an artificial intelligence-based measurement (AIM) tool for scoring histology in UC mucosal biopsies (AIM-HI UC). This model, trained on a large dataset of UC biopsies (N=10,230), utilizes additive multiple instance learning models leveraging PLUTO, a pathology foundation model, that predict each of the Geboes subgrades, from which the Geboes grade-level score, RHI, and NHI can be calculated. Evaluation of this model on a standalone verification set including clinical trial specimens established algorithm non-inferiority and/or superiority relative to standard qualified pathologists through comparison of algorithm-consensus and pathologist-consensus agreement metrics (non-inferior if difference >-0.1, superior if difference >0, inclusive of confidence intervals). AIM-HI UC was determined to be non-inferior to pathologists (N=3) for the prediction of all seven Geboes subgrades, grade-level Geboes, RHI, NHI, histologic improvement (GS<3.1), 2A histologic remission (GS<2A.0), and 2B histologic remission (GS<2B.0). AIM-HI UC was superior to pathologists for several Geboes subgrades (GS 0, GS 1, GS 2B, and GS 5), as well as grade-level Geboes, RHI, and positive percent agreement of 2A histologic remission. The model was shown to be greater than 99% repeatable for all histologic scoring metrics examined. Model-derived scores were shown to strongly correlate with canonical histologic features of inflammation, including the proportion of total epithelium that is inflamed (Spearman r=0.83; p<0.01), the proportion of neutrophils localized within crypt epithelium (Spearman r=0.83, p<0.01), and the amount of mucosal area classified as erosion or ulceration (Spearman r=0.80, p<0.01). Overall, these results suggest that AIM-HI UC has the potential to improve consistency of UC histology interpretation, providing a path toward standardization of UC histology scoring in clinical trials.
Komolafe, O. O.; Roberts, A. C.; Shelley, J.; Tawiah, A. K.
Show abstract
High-quality, domain-specific datasets are foundational to advancing educational tools and AI systems in healthcare, yet assembling case repositories from real-world clinical records faces substantial privacy, ethical, and licensing barriers. Synthetic data generation offers a compelling pathway forward, but educational cases require rigorous validation to ensure clinical plausibility and pedagogical utility. This pilot study introduces PhysiCase, a dual-layer validation pipeline for synthetic case generation and evaluates the feasibility of combining automated LLM-based screening with expert educator review. We generated 128 synthetic musculoskeletal(MSK) cases using four frontier large language models (GPT-4.1, GPT-4o, Google Gemini 2.5 Pro, and Llama 4 Scout) across 28 clinical conditions. Cases underwent automated quality screening using an "LLM-as-judge" framework (DeepEval) assessing prompt alignment, JSON correctness, answer relevance, bias, toxicity, and completeness. Ninety cases (70.3%) passed automated filtering and proceeded to expert evaluation by four MSK physiotherapy educators, who rated medical accuracy, realism, fidelity, relevance, and usability on 5-point Likert scales. GPT-4.1 demonstrated the highest automated pass rate (96\%) and strongest expert ratings (medical accuracy 4.10/5, usability 4.38/5), while Llama 4 Scout showed the lowest pass rate (33.3%) and expert ratings. Expert-evaluated cases achieved strong content validity indices for usability (97.5%), relevance (97.5%), and realism (95%), though medical accuracy showed greater variance (CVI 87.5%). Cross-layer correlation analysis revealed that automated completeness metrics moderately aligned with expert usability ratings , while answer relevance and prompt alignment showed weak or negative correlations with clinical correctness. Qualitative analysis identified three primary failure modes: reductive logic, biomechanical inconsistency, and administrative/contextual gaps. The dual-layer validation framework proved methodologically viable: automated screening efficiently reduced expert review burden, while human judgment remained indispensable for detecting subtle clinical reasoning failures. LLM-generated synthetic cases has the potential to meet practical educational needs for MSK physiotherapy, but expert validation is essential to safeguard clinical accuracy. These findings support a scalable division of labour for synthetic case development, with targeted improvements to prompting and automated reasoning checks needed to address identified "nuance gaps." The code for this paper is available on https://github.com/kwid-ai/PhysiCase
Serrano, A. E.
Show abstract
Machine learning (ML) has emerged as a transformative technology across biomedical and life science sectors, with applications spanning drug discovery, medical imaging, genomics, and clinical decision support (Goecks et al., 2020; Patel et al., 2020). Despite exponential growth in ML-related publications, from fewer than 100 articles in 2003 to nearly 25,000 by 2021 (NCBI, 2022), adoption among industry professionals remains uneven and sector-dependent. Understanding what drives or inhibits this adoption is critical for organisations seeking to leverage ML capabilities in research and clinical practice. Technology adoption in organisational contexts has been extensively studied through the Technology Acceptance Model (TAM), originally proposed by Davis (1989) and subsequently extended to incorporate external variables influencing perceived usefulness (PU) and perceived ease of use (PEU) (Venkatesh & Davis, 1996). While TAM has been applied across multiple industries, its application within biomedical and life science contexts remains limited, and the industry-specific factors that shape ML acceptance in this sector have not been systematically examined. Two external variables are particularly relevant to life science professionals. First, the bibliometric journal impact factor (JIF) functions as a cognitive signal of scientific credibility, a sector where evidence-based decision-making is culturally embedded, and publication quality serves as a proxy for technological legitimacy (Garfield, 1996). Second, technology hype, operationalised through the Gartner Hype Cycle framework, represents a social influence variable that shapes organisational expectations and investment decisions around emerging technologies (Gartner Inc., 2018). Whether these variables influence ML acceptance among life science professionals, alongside individual knowledge and experience, has not been empirically tested. This study addresses that gap by investigating ML technology acceptance among 213 biomedical and life science professionals across EMEA, LATAM, and North America, using a cross-sectional quantitative survey and PLS-SEM analysis. The TAM model is extended with three external variables, JIF, technology hype, and prior knowledge and experience, to test their influence on PU and PEU in this specific professional context. Additionally, the study examines demographic and regional differences in ML acceptance, with particular attention to variation between academic researchers and healthcare professionals. The findings contribute a validated, sector-specific extension of TAM for life sciences, provide actionable insights for organisations seeking to accelerate ML implementation, and establish a framework for future subsector-specific research.
Shimada, T.; Kodera, S.; Sawano, S.; Guan, J.; Saitoh, W.; Wakasa, S.; Ito, S.; Yanagishita, T.; Hayashi, Y.; Shibata, A.; Ito, A.; Otsuka, K.; Higashikuni, Y.; Okamura, H.; Tsujita, K.; Node, K.; Yamaguchi, O.; Makimoto, H.; Kabutoya, T.; Imai, Y.; Nakayama, M.; Sato, H.; Fujita, H.; Kohro, T.; Matoba, T.; Takeda, N.; Fukuda, D.; Nagai, R.
Show abstract
Background: Aortic stenosis (AS) is a progressive valvular disease associated with poor prognosis once symptoms develop, yet routine echocardiographic screening is impractical. While artificial intelligence (AI)-based electrocardiogram (ECG) models have shown promise for AS detection, it remains unclear whether they primarily reflect conventional left ventricular hypertrophy (LVH) voltage criteria or capture additional ECG features. Methods and Results: We developed a deep learning model using 244,816 ECGs from 51,713 patients across six academic institutions in Japan (CLIDAS database). AS labels were derived from inpatient Diagnosis Procedure Combination (DPC) codes. The model achieved an area under the receiver operating characteristic curve (AUC) of 0.849 (95% confidence interval 0.832-0.865) in the independent test cohort, with consistent performance across institutions, sex, and age. At a threshold of 0.1, sensitivity was 79.1%, specificity was 73.9%, and negative predictive value (NPV) was 98.0%. Conventional LVH voltage criteria (Sokolow-Lyon AUC 0.706; Cornell AUC 0.692) showed lower performance, and adding them to the AI model conferred no incremental benefit (AUC 0.849 vs. 0.847). Gradient-weighted class activation mapping (Grad-CAM) revealed predominant attention around QRS complexes in limb leads, beyond regions typically assessed in LVH evaluation. Conclusions: This multicenter AI-ECG model demonstrated strong discrimination for AS and captured ECG features beyond conventional LVH voltage criteria. The high NPV supports its use as a rule-out pre-screening tool.
Diaz, F. C.; Waldrup, B.; Carranza, F. G.; Manjarrez, S.; Velazquez-Villarreal, E.
Show abstract
Background: Pancreatic ductal adenocarcinoma (PDAC) is characterized by extensive molecular complexity, profound stromal remodeling, and limited responsiveness to systemic therapies. Although gemcitabine-based regimens remain widely utilized, the molecular pathways that influence treatment-associated biological variation are incompletely understood. The TGF{beta} and JAK/STAT signaling networks are recognized regulators of tumor progression, immune modulation, and therapeutic resistance; however, their genomic architecture in clinically stratified PDAC populations remains poorly defined. Methods: We employed a conversational artificial intelligence-driven analytical framework to investigate TGF{beta} and JAK/STAT pathway alterations in a cohort of 184 PDAC patients. Clinical and molecular data were integrated to generate age- and treatment-stratified cohorts, enabling pathway-level and gene-level analyses according to gemcitabine exposure. Findings generated through AI-assisted interrogation were subsequently evaluated using conventional statistical approaches. Results: TGF{beta} pathway alterations were identified in approximately one-quarter to one-third of tumors across clinical subgroups and demonstrated relatively stable frequencies regardless of age at diagnosis or gemcitabine treatment status. Gene-level analyses revealed that pathway disruption was predominantly driven by recurrent alterations in SMAD4, with additional low-frequency events involving TGFBR1 and TGFBR2. Notably, TGFBR2 mutations were significantly more frequent among late-onset PDAC patients receiving gemcitabine compared with untreated late-onset patients (8.8% vs. 1.4%; p = 0.04), suggesting a potential treatment-associated enrichment. In contrast, JAK/STAT pathway alterations were rare throughout the cohort, with only isolated mutations observed in pathway components including JAK1, JAK2, JAK3, STAT1, STAT3, and related regulatory genes. No significant differences in JAK/STAT alteration frequencies were identified according to age or treatment exposure. Conclusions: TGF{beta} and JAK/STAT pathways exhibit distinct genomic architectures in PDAC. TGF{beta} pathway disruption represents a recurrent feature of disease biology, largely driven by SMAD4 alterations, while TGFBR2 enrichment in gemcitabine-treated late-onset tumors suggests a potential context-specific association worthy of further investigation. Conversely, genomic alterations within the JAK/STAT pathway are uncommon, indicating that pathway activity may be regulated predominantly through non-genomic mechanisms. These findings demonstrate the utility of conversational artificial intelligence agents for rapid, scalable, and clinically contextualized pathway interrogation and support future studies integrating multi-omic data to refine precision medicine strategies in PDAC.
Collier, A.
Show abstract
Background Electronic health record documentation patterns may reflect workflow complexity, monitoring intensity, and operational strain in intensive care settings. However, documentation-derived features can be sensitive to local documentation culture, data capture systems, and outcome definitions. Retrospective validation across multiple datasets is therefore needed before these signals are used in workflow intelligence or clinical AI governance tools. Objective To evaluate whether documentation-density and documentation-timing features show reproducible retrospective signal for ICU workflow complexity and long-stay proxy outcomes across de-identified critical care datasets, while distinguishing workflow and long-stay associations from unsupported claims about mortality prediction, burden reduction, or deployment readiness. Methods We synthesized retrospective validation results from de-identified ICU and workflow datasets generated through a prespecified documentation-density validation program. Feature families included Documentation Burden Score style features, Shift-End Documentation Rate style features, documentation reliability style metadata, and all-documentation feature sets where available. Outcomes included long ICU length of stay proxies, mortality where available, and workflow proxy endpoints. Models compared baseline feature sets with enhanced models containing documentation-density or workflow features. Performance was summarized using area under the receiver operating characteristic curve, Brier score where reported, delta AUROC, bootstrap confidence intervals where reported, and label-shuffle controls where available. Results The strongest external long-stay proxy evidence came from the NWICU chartevents analysis, which included 28,612 ICU stays, 20,267 stays with chart events, and 9,619,759 chart events. For ICU length of stay greater than the median, baseline AUROC was 0.5252. Enhanced AUROC was 0.9512 for Documentation Burden Score features, 0.9214 for Shift-End Documentation Rate features, 0.8470 for documentation reliability style features, and 0.9517 for all documentation features. Corresponding label-shuffle enhanced AUROCs were near random, ranging from 0.4897 to 0.5064. For ICU length of stay greater than the 75th percentile, baseline AUROC was 0.5155. Enhanced AUROC was 0.9433 for Documentation Burden Score features, 0.9194 for Shift-End Documentation Rate features, 0.8118 for documentation reliability style features, and 0.9427 for all documentation features, with label-shuffle enhanced AUROCs from 0.4836 to 0.4999. Additional retrospective support was observed in eICU workflow analyses, HiRID first-24-hour documentation-density analyses, MIMIC-IV HF ICU internal analyses, MIMIC-IV-Note metadata extensions, and nursing-chart or lab density proxy analyses. However, cross-institution discrimination transfer was weak without recalibration, and several analyses remained proxy validations rather than final clinical validations. Conclusions Documentation-density and documentation-timing features show promising retrospective signal for ICU workflow complexity and long-stay proxy outcomes, especially in NWICU chartevents and selected internal dataset-specific analyses. These findings support further preregistered, prospective, silent-mode validation of documentation-derived workflow intelligence. They do not establish prospective clinical performance, mortality reduction, clinician burden reduction, autonomous deterioration prediction, or deployment readiness.